Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi data path config can cause a shard to be perceived as corrupted #4674

Closed
kimchy opened this issue Jan 9, 2014 · 3 comments
Closed

Comments

@kimchy
Copy link
Member

kimchy commented Jan 9, 2014

Multi data path config support writes a file to a data location based on the available size (by default). There is a Lucene file called segments.gen that has the same name, and only in that case, we need to make sure we alway write it to the same data location, otherwise, the index will have multiple segments.gen files, and the shard can seem to be corrupted.

The message if this case happens is that segments_xxx file was not found, in which case, a find for segments.gen can yield multiple files. Deleting the segments.gen files will cause the shard to recover properly (as its an extra protection layer to resolve the segments header by Lucene)

kimchy added a commit to kimchy/elasticsearch that referenced this issue Jan 9, 2014
Multi data path config support writes a file to a data location based on the available size (by default). There is a Lucene file called segments.gen that has the same name, and only in that case, we need to make sure we alway write it to the same data location, otherwise, the index will have multiple segments.gen files, and the shard can seem to be corrupted.

The message if this case happens is that segments_xxx file was not found, in which case, a find for segments.gen can yield multiple files. Deleting the segments.gen files will cause the shard to recover properly (as its an extra protection layer to resolve the segments header by Lucene)

Make sure the segments.gen file is writtne to the same directory every time
fixes elastic#4674
@kimchy kimchy closed this as completed in da680be Jan 9, 2014
kimchy added a commit that referenced this issue Jan 9, 2014
Multi data path config support writes a file to a data location based on the available size (by default). There is a Lucene file called segments.gen that has the same name, and only in that case, we need to make sure we alway write it to the same data location, otherwise, the index will have multiple segments.gen files, and the shard can seem to be corrupted.

The message if this case happens is that segments_xxx file was not found, in which case, a find for segments.gen can yield multiple files. Deleting the segments.gen files will cause the shard to recover properly (as its an extra protection layer to resolve the segments header by Lucene)

Make sure the segments.gen file is writtne to the same directory every time
fixes #4674
@sbarton
Copy link

sbarton commented Jan 15, 2014

Hi,

I was using ES 0.90.7 and my shards were after restart of the cluster going missing. Physically, I can see the files (in 6 cases out of 8) still on the disk, but the shards won't come up with the indication of a segments_X file missing. I have followed your advice of removing the segments.gen file in order to recover, but the shards are not coming up. Even after restart of the cluster, the shards are not coming up, the error is now:

[2014-01-15 18:46:00,504][DEBUG][cluster.service ] [Synch] processing [shard-failed ([1millionnewv2][4], node[fWigetX5QNar2zIpvkEK_Q], [P], s[INITIALIZING]), reas
on [Failed to start shard, message [IndexShardGatewayRecoveryException[[1millionnewv2][4] failed to fetch index version after copying it over]; nested: IndexShardGatewayRe
coveryException[[1millionnewv2][4] shard allocated for local recovery (post api), should exist, but doesn't, current files: [_s41.nvd, _twd_es090_0.tip, _um4.nvm,... ]]; nested: IndexNotF
oundException[no segments* file found in store(least_used[rate_limited(niofs(/0/elasticsearch/elasticsearch/nodes/0/indices/1millionnewv2/4/index), type=MERGE, rate=20.0),
rate_limited(niofs(/1/elasticsearch/elasticsearch/nodes/0/indices/1millionnewv2/4/index), type=MERGE, rate=20.0), rate_limited(niofs(/2/elasticsearch/elasticsearch/nodes/
0/indices/1millionnewv2/4/index), type=MERGE, rate=20.0), rate_limited(niofs(/3/elasticsearch/elasticsearch/nodes/0/indices/1millionnewv2/4/index), type=MERGE, rate=20.0)]
): files: [ .... ] ]]]: no change in cluster_state

What can I do if the usual strategy of the segments.gen file removal doesn't work? I have even updated the ES to the latest 0.90.10 version but the restart problems are still the same and the shards are not coming up. I have lost a considerable amount of data, in those 2 shards that were wiped out, but I could still save a lot of data in the 6 shards I am could recover.

brusic pushed a commit to brusic/elasticsearch that referenced this issue Jan 19, 2014
Multi data path config support writes a file to a data location based on the available size (by default). There is a Lucene file called segments.gen that has the same name, and only in that case, we need to make sure we alway write it to the same data location, otherwise, the index will have multiple segments.gen files, and the shard can seem to be corrupted.

The message if this case happens is that segments_xxx file was not found, in which case, a find for segments.gen can yield multiple files. Deleting the segments.gen files will cause the shard to recover properly (as its an extra protection layer to resolve the segments header by Lucene)

Make sure the segments.gen file is writtne to the same directory every time
fixes elastic#4674
@kimchy
Copy link
Member Author

kimchy commented Jan 20, 2014

@sbarton are you sure you deleted the segments.gen from all data directories for the relevant shard ([1millionnewv2][4]) (or, better yet, just delete it recursively across all data locations)? the failure suggests you potentially didn't.

@sbarton
Copy link

sbarton commented Feb 6, 2014

@kimchy I have made sure I removed the segments.gen from all data directories (using find command) but still no luck. I at the end gave up (I had more shards in the same situation on other machines - tried the same approach but none of them came back) and re-indexed the whole thing once again. But I can say, that the 0.90.10 version is not loosing shards even after several harsh restarts.

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
Multi data path config support writes a file to a data location based on the available size (by default). There is a Lucene file called segments.gen that has the same name, and only in that case, we need to make sure we alway write it to the same data location, otherwise, the index will have multiple segments.gen files, and the shard can seem to be corrupted.

The message if this case happens is that segments_xxx file was not found, in which case, a find for segments.gen can yield multiple files. Deleting the segments.gen files will cause the shard to recover properly (as its an extra protection layer to resolve the segments header by Lucene)

Make sure the segments.gen file is writtne to the same directory every time
fixes elastic#4674
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants